outline

 

dissimilarity- and distance-based unsupervised learning

 

independence-based vs association-based

 

taking into account continuous/categorical interactions

 

example and future work

 

dissimilarity- and distance-based unsupervised learning

learning from dissimilarities

some unsupervised learning methods take as input a dissimilarity matrix

 

dimension reduction: multidimensional scaling (MDS)1

 

clustering methods: hierarchical (HC) and partitioning around medoids (PAM)2

 

the dissimilarity measure of choice is key, obviously

intuition

2 continuous variables: add up by-variable (absolute value or squared) differences

intuition

2 continuous variables: add up by-variable (absolute value or squared) differences

intuition

2 continuous variables: add up by-variable (absolute value or squared) differences

intuition

2 continuous variables: add up by-variable (absolute value or squared) differences

intuition

2 continuous variables: add up by-variable (absolute value or squared) differences

intuition

2 continuous and 1 categorical variables

intuition

one might consider purple and blue closer than e.g. purple and yellow

independence-based

Most commonly used dissimilarity (or, distance) measures are based on by-variable differences that are then added together

 

  • in the continuous case: Euclidean or Manhattan distances

 

  • in the categorical case: Hamming (matching) distance (among MANY others)

 

  • in the mixed data case: Gower dissimilarity index

 

no inter-variable relations are considered \(\rightarrow\) independence-based

independence-based

  • When variables are correlated or associated, shared information is effectively counted multiple times

  • inflated dissimilarities may cause potential distortions in downstream unsupervised learning tasks.

independence-based

  • When variables are correlated or associated, shared information is effectively counted multiple times

  • inflated dissimilarities may cause potential distortions in downstream unsupervised learning tasks.

independence-based

The Euclidean distance \(\longrightarrow\) shared information is over-counted

association-based

The Mahalanobis distance \(\longrightarrow\) shared information is not over-counted

this is an association-based distance for continuous data

association-based pairwise distance

  • differences in line with the inter-variables association/correlation are down-weighted

Association-based for continuous: Mahalanobis distance

Let \({\bf X}_{con}\) be \(n\times Q_{d}\) a data matrix of \(n\) observations described by \(Q_{d}\) continuous variables, and let \(\bf S\) the sample covariance matrix, the Mahalanobis distance matrix is

\[ {\bf D}_{mah} = \left[\operatorname{diag}({\bf G})\,{\bf 1}_{n}^{\sf T} + {\bf 1}_{n}\,\operatorname{diag}({\bf G})^{\sf T} - 2{\bf G}\right]^{\odot 1/2} \] where

  • \([\cdot]^{\odot 1/2}\) denotes the element-wise square root

  • \({\bf G}=({\bf C}{\bf X}_{con}){\bf S}^{-1}({\bf C}{\bf X}_{con})^{\sf T}\) is the Mahalanobis Gram matrix

  • \({\bf C}={\bf I}_{n}-\tfrac{1}{n}{\bf 1}_{n}{\bf 1}_{n}^{\sf T}\) is the centering operator

association-based pairwise distance

  • differences in line with the inter-variables association/correlation are down-weighted

Association-based for categorical: total variation distance (TVD)1

To distance matrix \({\bf D}_{tvd}\) is defined using the so-called delta framework2 a general way to define categorical data distances.

Let \({\bf X}_{cat}\) be \(n\times Q_{c}\) a data matrix of \(n\) observations described by \(Q_{c}\) categorical variables.

\[ {\bf D} = {\bf Z}{\Delta}{\bf Z}^{\sf T} = \left[\begin{array}{ccc} {\bf Z}_{1} & \dots & {\bf Z}_{Q_{c}} \end{array} \right]\left[\begin{array}{ccc} {\bf\Delta}_1 & & \\ & \ddots &\\ & & {\bf\Delta}_{Q_{c}} \end{array} \right] \left[ \begin{array}{c} {\bf Z}_{1}^{\sf T}\\ \vdots \\ {\bf Z}_{Q_{c}}^{\sf T} \end{array} \right] \] - where \({\bf Z}=[{\bf Z}_1,\ldots,{\bf Z}_{Q_c}]\) is the super-indicator matrix, with \(Q^{*}=\sum_{j=1}^{Q_c} q_j\)

  • \({\Delta}_j\) is the category dissimilarity matrix for variable \(j\), i.e., the \(j\)th diagonal block of the block-diagonal matrix \({\Delta}\).

  • setting \({\Delta}_j\) determines the categorical distance measure of choice (independent- or association-based)

association-based pairwise distance

  • differences in line with the inter-variables association/correlation are down-weighted

Association-based for categorical: total variation distance (TVD)1 (2)

Consider the empirical joint probability distributions stored in the off-diagonal blocks of \({\bf P}\):

\[ {\bf P} = \frac{1}{n} \begin{bmatrix} {\bf Z}_1^{\sf T}{\bf Z}_1 & {\bf Z}_1^{\sf T}{\bf Z}_2 & \cdots & {\bf Z}_1^{\sf T}{\bf Z}_{Q_c} \\ \vdots & \ddots & \vdots & \vdots \\ {\bf Z}_{Q_c}^{\sf T}{\bf Z}_1 & {\bf Z}_{Q_c}^{\sf T}{\bf Z}_2 & \cdots & {\bf Z}_{Q_c}^{\sf T}{\bf Z}_{Q_c} \end{bmatrix}. \]

We refer to the conditional probability distributions for each variable \(j\) given each variable \(i\) (\(i,j=1,\ldots,Q_c\), \(i\neq j\)), stored in the block matrix

\[ {\bf R} = {\bf P}_z^{-1}({\bf P} - {\bf P}_z). \]

where \({\bf P}_z = {\bf P} \odot {\bf I}_{Q^*}\), and \({\bf I}_{Q^*}\) is the \(Q^*\times Q^*\) identity matrix.

association-based pairwise distance

  • differences in line with the inter-variables association/correlation are down-weighted

Association-based for categorical: total variation distance (TVD)1 (3)

Let \({\bf r}^{ji}_a\) and \({\bf r}^{ji}_b\) be the rows of \({\bf R}_{ji}\), the \((j,i)\)th off-diagonal block of \({\bf R}\) The category dissimilarity between \(a\) and \(b\) for variable \(j\) based on the total variation distance (TVD) is defined as

\[ \delta^{j}_{tvd}(a,b) = \sum_{i\neq j}^{Q_c} w_{ji} \Phi^{ji}({\bf r}^{ji}_{a},{\bf r}^{ji}_{b}) = \sum_{i\neq j}^{Q_c} w_{ji} \left[\frac{1}{2}\sum_{\ell=1}^{q_i} |{\bf r}^{ji}_{a\ell}-{\bf r}^{ji}_{b\ell}|\right], \label{ab_delta} \]

where \(w_{ji}=1/(Q_c-1)\) for equal weighting (can be user-defined).

TVD-based dissimilarity matrix is, therefore,

\[ {\bf D}_{tvd}= {\bf Z}{\Delta}^{(tvd)}{\bf Z}^{\sf T}. \]

AB for mixed?

association-based for mixed

A straightforward AB-distance for mixed data is given by the convex combination of Mahalanobis and TVD distances:

\[ {\bf D}_{mix} =\frac{Q_{d}}{Q}\,{\bf D}_{mah} +\left(1-\frac{Q_{d}}{Q}\right){\bf D}_{tvd}. \]

  • this distance only accounts for correlations or associations among variables of the same type

  • no continuous–categorical interactions are considered.

 

how to measure interactions?

how to measure interactions

define \(\Delta_{int}\), that accounts for the interactions and augment \(\Delta_{(tvd)}\)

  • the dissimilarity measure becomes

\[ {\bf D}_{mix}^{(int)} = {\bf D}_{mah} + {\bf D}_{cat}^{(int)}. \]

where

\[ {\bf D}_{cat}^{(int)}={\bf Z}\tilde{\Delta}{\bf Z}^\top \] and

\[ \tilde{\Delta} = (1-\alpha)\Delta^{tvd} + \alpha \Delta^{int} \] where \(\alpha=\frac{1}{Q_{c}}\).

how to measure interactions

What is \(\Delta^{int}\)?

  • the general entry for the \(j^{th}\) diagonal block is \(\delta_{int}^{j}(a,b)\) accounts for the interaction by measuring how the continuous variables help in discriminating between the observations choosing category \(a\) and those choosing category \(b\) for the \(j^{th}\) categorical variable
  • consider the computation of \(\delta_{int}^{ij}\left({ab}\right)\) as a two-class (\(a/b\)) classification problem, with the continuous variables as predictors
    • use a distance-based classifier: nearest-neighbors

\(\Delta^{int}_{j}\) computation

  • consider \({\bf D}_{mah}\) and sort it to identify the neighbors for each observation.

  • set a proportion of neighbors to consider, say \(\hat{\pi}_{nn}=0.1\)

  • for each pair of categories \((a,b)\), \(a,b=1,\ldots,q_{j}\), \(a\neq b\) of the \(j^{th}\) categorical variable:

  • classify the observations using the prior corrected1 decision rule

    \[ \text{if $i$ is such that }\ \ \ \frac{\hat{\pi}_{nn}(a)}{\hat{\pi}(a)}\geq\frac{\hat{\pi}_{nn}(b)}{\hat{\pi}(b)} \ \ \ \text{ then assign $i$ to class $a$ else to class $b$} \]

  • compute \(\delta_{int}^{j}(a,b)\) as the balanced accuracy2 (average of class-wise sensitivities) \[ \Phi_{int}^{j}(a,b)=\frac{1}{2}\left(\frac{\texttt{true } a}{\texttt{true } a + \texttt{false }a}+\frac{\texttt{true } b}{\texttt{true } b + \texttt{false }b}\right) \]

well separated or not

Building \(\Delta^{int}_{j}\)

for the general categorical variable \(j\) with \(q_{j}\) you compute the quantities for \(\frac{q_j(q_j -1)}{2}\) pairs

 

 

 

 

\[ \Delta_{int} = \begin{pmatrix} 0 & \color{indianred}{0.94} & \cdot & \cdot \\ \color{indianred}{0.94} & 0 & \cdot & \cdot \\ \cdot & \cdot & 0 & \cdot\\ \cdot & \cdot & \cdot & 0 \end{pmatrix} \]

Building \(\Delta^{int}_{j}\)

for the general categorical variable \(j\) with \(q_{j}\) you compute the quantities for \(\frac{q_j(q_j -1)}{2}\) pairs

 

 

 

 

\[ \Delta_{int} = \begin{pmatrix} 0 & \cdot & \cdot & \cdot \\ \cdot & 0 & \cdot & \cdot \\ \cdot & \cdot & 0 & \cdot\\ \cdot & \cdot & \cdot & 0 \end{pmatrix} \]

Building \(\Delta^{int}_{j}\)

for the general categorical variable \(j\) with \(q_{j}\) you compute the quantities for \(\frac{q_j(q_j -1)}{2}\) pairs

 

 

 

 

\[ \Delta_{int} = \begin{pmatrix} 0 & 0.94 & \color{indianred}{0.4} & \cdot \\ 0.94 & 0 & \cdot & \cdot \\ \color{indianred}{0.4} & \cdot & 0 & \cdot\\ \cdot & \cdot & \cdot & 0 \end{pmatrix} \]

Building \(\Delta^{int}_{j}\)

for the general categorical variable \(j\) with \(q_{j}\) you compute the quantities for \(\frac{q_j(q_j -1)}{2}\) pairs

 

 

 

 

\[ \Delta_{int} = \begin{pmatrix} 0 & 0.94 & 0.4 & \color{indianred}{0.39} \\ 0.94 & 0 & \cdot & \cdot \\ 0.4 & \cdot & 0 & \cdot\\ \color{indianred}{0.39} & \cdot & \cdot & 0 \end{pmatrix} \]

Building \(\Delta^{int}_{j}\)

for the general categorical variable \(j\) with \(q_{j}\) you compute the quantities for \(\frac{q_j(q_j -1)}{2}\) pairs

 

 

 

 

\[ \Delta_{int} = \begin{pmatrix} 0 & 0.94 & 0.4 & 0.39 \\ 0.94 & 0 & \color{indianred}{0.54} & \cdot \\ 0.4 & \color{indianred}{0.54} & 0 & \cdot \\ 0.39 & \cdot & \cdot & 0 \end{pmatrix} \]

Building \(\Delta^{int}_{j}\)

for the general categorical variable \(j\) with \(q_{j}\) you compute the quantities for \(\frac{q_j(q_j -1)}{2}\) pairs

 

 

 

 

\[ \Delta_{int} = \begin{pmatrix} 0 & 0.94 & 0.4 & 0.39 \\ 0.94 & 0 & 0.54 & \color{indianred}{0.55} \\ 0.4 & 0.54 & 0 & \cdot \\ 0.39 & \color{indianred}{0.55} & \cdot & 0 \end{pmatrix} \]

Building \(\Delta^{int}_{j}\)

for the general categorical variable \(j\) with \(q_{j}\) you compute the quantities for \(\frac{q_j(q_j -1)}{2}\) pairs

 

 

 

 

\[ \Delta_{int} = \begin{pmatrix} 0 & 0.94 & 0.4 & 0.39 \\ 0.94 & 0 & 0.54 & 0.55 \\ 0.4 & 0.54 & 0 & \color{indianred}{0} \\ 0.39 & 0.55 & \color{indianred}{0} & 0 \end{pmatrix} \]

Just one-way interaction?

spectral clustering in a nutshell

Spectral clustering: a graph partitioning problem

Graph representation

a graph representation of the data matrix \(\bf X\): the aim is then to cut it into K groups (clusters)

the affinity matrix \({\bf A}\)

the elements \(\bf w_{ij}\) of \(\bf A\) are high (low) if \(i\) and \(j\) are in the same (different) groups

. a b c d
a 0 0 w_ac 0
b 0 0 w_cb w_bd
c w_ca w_cb 0 w_cd
d 0 w_db w_dc 0

Spectral clustering: making the graph easy to cut

An approximated solution to the graph partitioning problem:

  • spectral decomposition of the graph Laplacian matrix, that is a normalized version of the affinity matrix \({\bf A}\):

\[\color{dodgerblue}{\bf{L}} = {\bf D}_{r}^{-1/2}\underset{\color{grey}{\text{affinity matrix } {\bf A}}}{\color{dodgerblue}{exp(-{\bf D}^{2}(2\sigma^{2})^{-1})}}{\bf D}_{r}^{-1/2} =\color{dodgerblue}{{\bf Q}{\Lambda}{\bf Q}^{\sf T}}\]

  • \(\bf D\) be the \(n\times n\) matrix of pairwise Euclidean distances

  • the \(\sigma\) parameter dictates the number of neighbors each observation is linked to (rule of thumb: median distance to the 20th nearest neighbor)

  • diagonal terms of \(\bf A\) are set to zero: \(a_{ii}=0\) , \(i=1,\ldots,n\)

  • \({\bf D}_{r}=diag({\bf r})\) , \({\bf r}={\bf A}{\bf 1}\) and \({\bf 1}\) is an \(n\)-dimensional vector of 1’s

  • the spectral clustering of the \(n\) original objects is a \(K\)-means applied on the rows of the matrix \({\bf{\tilde Q}}\), containing the first \(K\) columns of \(\bf Q\)

Spectral clustering: solution and performance

Recent benchmarking studies1: SC works well, particularly in case of non-convex and overlapping clusters

toy experiment

does \(\delta_{int}^{j}(a,b)\) of help?

a toy scenario

  • a latent cluster membership

  • 1 signal categorical variable partially associated to the latent clusters

  • 4 categorical variables: 1 signal, 3 noise, mutually independent

  • 2 continuous variables, unrelated to the latent clusters, but with discriminant power for some categories of the signal categorical variable

compared methods

  • gower dissimilarity: a straightforward option

  • naive SC for mixed-type 1

  • GUDMM2: an entropy based approach that takes into account inter- and intra-variable relation

  • unbiased association-based approach3: association-based, but no interactions considered

toy experiment

Final considerations and future work

  • association-based measures aim to go beyond match/mismatch of categories

  • when the signal is limited to few variables, retrieving information from cont/cat interaction may be useful

  • measuring interactions via non-parametric approach NN-based approach is suitable for non-convex/oddly shaped clusters

  • commensurability makes the by-variable contributions to distance even

  • computationally demanding (but it can be made bearable)

  • \(\pi_{nn}\) tuning and regularization of \(\delta_{int}\)’s to reduce variability

an R package to compute distances: anydist?

an R package to compute distances: manydist! (it’s on CRAN!)

main references

Le, S. Q. and T. B. Ho (2005a). “An association-based dissimilarity measure for categorical data”. In: Pattern Recognition Letters 26.16, pp. 2549-2557.

Mbuga, F. and C. Tortora (2021a). “Spectral Clustering of Mixed-Type Data”. In: Stats 5.1, pp. 1-11.

Mousavi, E. and M. Sehhati (2023a). “A Generalized Multi-Aspect Distance Metric for Mixed-Type Data Clustering”. In: Pattern Recognition, p. 109353.

Murugesan, N., I. Cho, and C. Tortora (2021). “Benchmarking in cluster analysis: a study on spectral clustering, DBSCAN, and K-Means”. In: Conference of the International Federation of Classification Societies. Springer. , pp. 175-185.

Velden, M. van de, A. Iodice D’Enza, A. Markos, et al. (2024). “A general framework for implementing distances for categorical variables”. In: Pattern Recognition 153, p. 110547.

Velden, M. van de, A. Iodice D’Enza, A. Markos, et al. (2025a). “A general framework for unbiased mixed-variables distances”. In: second round review: Journal of Computational and Graphical Statistics.